fix: filter non-productive TCRs, dynamic gene family columns, single-… by KevinMLanderos · Pull Request #82 · KarchinLab/TCRtoolkit

KevinMLanderos · 2026-05-15T20:24:59Z

Three bugs were fixed:

bin/sample_calc.py — Gene family CSVs (v_family, d_family, j_family) were built with hard-coded maximum indices (TRBV: 30, TRBD: 2, TRBJ: 2), silently dropping any gene with a number above those limits. The max index is now derived dynamically from genes observed in each sample. Samples with no valid calls for a gene type write a sample-only row instead of crashing.
modules/local/annotate/main.nf — ANNOTATE_PROCESS did not filter for productive TCRs, so non-productive rearrangements propagated into every downstream file: concatenated_cdr3_sorted.tsv, OLGA pgen calculation, TCRSHARING, and the patient workflow (GIANA, GLIPH2, overlap metrics). The process now reads the 'productive' column when present and retains only productive entries before writing per-sample _cdr3.tsv files, fixing all downstream analyses in one place.
bin/pseudobulk.py — Cell Ranger AIRR output was not filtered for cell or contig quality before pseudobulking. is_cell, high_confidence, and productive filters are now applied in both pseudobulk() and pseudobulk_phenotype() when those columns are present, ensuring background barcodes, low-confidence assemblies, and non-productive contigs are excluded from single-cell input.

…cell quality filters Three bugs were fixed: 1. bin/sample_calc.py — Gene family CSVs (v_family, d_family, j_family) were built with hard-coded maximum indices (TRBV: 30, TRBD: 2, TRBJ: 2), silently dropping any gene with a number above those limits. The max index is now derived dynamically from genes observed in each sample. Samples with no valid calls for a gene type write a sample-only row instead of crashing. 2. modules/local/annotate/main.nf — ANNOTATE_PROCESS did not filter for productive TCRs, so non-productive rearrangements propagated into every downstream file: concatenated_cdr3_sorted.tsv, OLGA pgen calculation, TCRSHARING, and the patient workflow (GIANA, GLIPH2, overlap metrics). The process now reads the 'productive' column when present and retains only productive entries before writing per-sample _cdr3.tsv files, fixing all downstream analyses in one place. 3. bin/pseudobulk.py — Cell Ranger AIRR output was not filtered for cell or contig quality before pseudobulking. is_cell, high_confidence, and productive filters are now applied in both pseudobulk() and pseudobulk_phenotype() when those columns are present, ensuring background barcodes, low-confidence assemblies, and non-productive contigs are excluded from single-cell input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-15T20:25:57Z

Unit Test Results

10 tests 2 ✅ 21s ⏱️
2 suites 0 💤
1 files 8 ❌

For more details on these failures, see this check.

Results for commit 389a328.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: filter non-productive TCRs, dynamic gene family columns, single-…#82

fix: filter non-productive TCRs, dynamic gene family columns, single-…#82
KevinMLanderos wants to merge 1 commit into
mainfrom
fix_details

KevinMLanderos commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KevinMLanderos commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Unit Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant